{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zzZbP0LM6m5z"
      },
      "source": [
        "# Extractive QA with Elasticsearch\n",
        "\n",
        "txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xk7t5Jcd6reO"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and `Elasticsearch`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0y1UA4-q-YdA"
      },
      "source": [
        "%%capture\n",
        "\n",
        "# Install txtai and elasticsearch python client\n",
        "!pip install git+https://github.com/neuml/txtai elasticsearch\n",
        "\n",
        "# Download and extract elasticsearch\n",
        "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
        "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
        "!chown -R daemon:daemon elasticsearch-7.10.1"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nKWz-C5gCJy8"
      },
      "source": [
        "Start an instance of Elasticsearch directly within this notebook. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3ZfJeWbM6wmj"
      },
      "source": [
        "import os\n",
        "from subprocess import Popen, PIPE, STDOUT\n",
        "\n",
        "# If issues are encountered with this section, ES can be manually started as follows:\n",
        "# ./elasticsearch-7.10.1/bin/elasticsearch\n",
        "\n",
        "# Start and wait for server\n",
        "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n",
        "!sleep 30"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TWEn4w68-D1y"
      },
      "source": [
        "# Download data\n",
        "\n",
        "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n",
        "\n",
        "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8tVrIqSq-KBa"
      },
      "source": [
        "%%capture\n",
        "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n",
        "!gunzip tests.gz\n",
        "!mv tests articles.sqlite"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hSWFzkCn61tM"
      },
      "source": [
        "# Load data into Elasticsearch\n",
        "\n",
        "The following block copies rows from SQLite to Elasticsearch."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "So-OBvUT61QD",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "9647b8f8-8471-41bf-ccfa-a75306665638"
      },
      "source": [
        "import sqlite3\n",
        "\n",
        "import regex as re\n",
        "\n",
        "from elasticsearch import Elasticsearch, helpers\n",
        "\n",
        "# Connect to ES instance\n",
        "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n",
        "\n",
        "# Connection to database file\n",
        "db = sqlite3.connect(\"articles.sqlite\")\n",
        "cur = db.cursor()\n",
        "\n",
        "# Elasticsearch bulk buffer\n",
        "buffer = []\n",
        "rows = 0\n",
        "\n",
        "# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n",
        "cur.execute(\"SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null\")\n",
        "for row in cur:\n",
        "  # Build dict of name-value pairs for fields\n",
        "  article = dict(zip((\"id\", \"article\", \"title\", \"published\", \"reference\", \"name\", \"text\"), row))\n",
        "  name = article[\"name\"]\n",
        "\n",
        "  # Only process certain document sections\n",
        "  if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
        "    # Bulk action fields\n",
        "    article[\"_id\"] = article[\"id\"]\n",
        "    article[\"_index\"] = \"articles\"\n",
        "\n",
        "    # Buffer article\n",
        "    buffer.append(article)\n",
        "\n",
        "    # Increment number of articles processed\n",
        "    rows += 1\n",
        "\n",
        "    # Bulk load every 1000 records\n",
        "    if rows % 1000 == 0:\n",
        "      helpers.bulk(es, buffer)\n",
        "      buffer = []\n",
        "\n",
        "      print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n",
        "\n",
        "if buffer:\n",
        "  helpers.bulk(es, buffer)\n",
        "\n",
        "print(\"Total articles inserted: {}\".format(rows))\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Total articles inserted: 21499\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X5RO-VNwzMAo"
      },
      "source": [
        "# Query data\n",
        "\n",
        "The following runs a query against Elasticsearch for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ucd9mwSfFTMm",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 348
        },
        "outputId": "b21d6aff-6abe-48f5-9914-7b7fb8472adb"
      },
      "source": [
        "import pandas as pd\n",
        "\n",
        "from IPython.display import display, HTML\n",
        "\n",
        "pd.set_option(\"display.max_colwidth\", None)\n",
        "\n",
        "query = {\n",
        "    \"_source\": [\"article\", \"title\", \"published\", \"reference\", \"text\"],\n",
        "    \"size\": 5,\n",
        "    \"query\": {\n",
        "        \"query_string\": {\"query\": \"risk factors\"}\n",
        "    }\n",
        "}\n",
        "\n",
        "results = []\n",
        "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n",
        "  source = result[\"_source\"]\n",
        "  results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]))\n",
        "\n",
        "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n",
        "\n",
        "display(HTML(df.to_html(index=False)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Title</th>\n",
              "      <th>Published</th>\n",
              "      <th>Reference</th>\n",
              "      <th>Match</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n",
              "      <td>2020-04-24 00:00:00</td>\n",
              "      <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n",
              "      <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
              "      <td>2020-04-27 00:00:00</td>\n",
              "      <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
              "      <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
              "      <td>2020-07-23 00:00:00</td>\n",
              "      <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
              "      <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
              "      <td>2020-03-15 00:00:00</td>\n",
              "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
              "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n",
              "      <td>2020-05-11 00:00:00</td>\n",
              "      <td>http://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?rss=1</td>\n",
              "      <td>In addition, many risk factors for covid-19 documented in the literature are highly correlated and it is not clear which may be independently related to risk.</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ylxOKji1-9_K"
      },
      "source": [
        "# Derive columns with Extractive QA\n",
        "\n",
        "The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mwBTrCkcOM_H"
      },
      "source": [
        "%%capture\n",
        "from txtai.embeddings import Embeddings\n",
        "from txtai.pipeline import Extractor\n",
        "\n",
        "# Create embeddings model, backed by sentence-transformers & transformers\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n",
        "\n",
        "# Create extractor instance using qa model designed for the CORD-19 dataset\n",
        "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Yv75Lh-cOpL9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 400
        },
        "outputId": "adee88e1-02bf-4a20-febb-6d2c170a63f9"
      },
      "source": [
        "document = {\n",
        "    \"_source\": [\"id\", \"name\", \"text\"],\n",
        "    \"size\": 1000,\n",
        "    \"query\": {\n",
        "        \"term\": {\"article\": None}\n",
        "    },\n",
        "    \"sort\" : [\"id\"]\n",
        "}\n",
        "\n",
        "def sections(article):\n",
        "  rows = []\n",
        "\n",
        "  search = document.copy()\n",
        "  search[\"query\"][\"term\"][\"article\"] = article\n",
        "\n",
        "  for result in es.search(index=\"articles\", body=search)[\"hits\"][\"hits\"]:\n",
        "    source = result[\"_source\"]\n",
        "    name, text = source[\"name\"], source[\"text\"]\n",
        "\n",
        "    if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
        "      rows.append(text)\n",
        "  \n",
        "  return rows\n",
        "\n",
        "results = []\n",
        "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n",
        "  source = result[\"_source\"]\n",
        "\n",
        "  # Use QA extractor to derive additional columns\n",
        "  answers = extractor([(\"Risk factors\", \"risk factor\", \"What are names of risk factors?\", False),\n",
        "                       (\"Locations\", \"city country state\", \"What are names of locations?\", False)], sections(source[\"article\"]))\n",
        "\n",
        "  results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]) + tuple([answer[1] for answer in answers]))\n",
        "\n",
        "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n",
        "\n",
        "display(HTML(df.to_html(index=False)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Title</th>\n",
              "      <th>Published</th>\n",
              "      <th>Reference</th>\n",
              "      <th>Match</th>\n",
              "      <th>Risk Factors</th>\n",
              "      <th>Locations</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
              "      <td>2020-05-21 00:00:00</td>\n",
              "      <td>https://doi.org/10.1002/cpt.1910</td>\n",
              "      <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n",
              "      <td>sex, obesity, genetic factors and mechanical factors</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n",
              "      <td>2020-04-24 00:00:00</td>\n",
              "      <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n",
              "      <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n",
              "      <td>None</td>\n",
              "      <td>Abbott, Abbott Park, Illinois</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
              "      <td>2020-04-27 00:00:00</td>\n",
              "      <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
              "      <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n",
              "      <td>None</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
              "      <td>2020-07-23 00:00:00</td>\n",
              "      <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
              "      <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n",
              "      <td>Frailty and multimorbidity</td>\n",
              "      <td>comorbidity groupings and the corresponding health conditions</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
              "      <td>2020-03-15 00:00:00</td>\n",
              "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
              "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
              "      <td>Mandatory contact tracing and quarantine</td>\n",
              "      <td>cities, provinces, and countries</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    }
  ]
}
